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....[1] for the generation of software pipelined schedules. Modulo scheduling [26] is a class of software pipelining algorithms that is 
very cost effective and has been implemented in many production compilers. Most of the early modulo scheduling techniques 
focused mainly on achieving high throughput [1 1 7, 25, 28J. However, one of the drawbacks of modulo scheduling (and 
software pipelining in general) is that they increase the register requirements. This has motivated some recent modulo scheduling 
approaches that not only try to maximize throughput but also try to minimize register requirements [6, 9, 16, .... 

....forcing a node in a particular cycle, the heuristic ejects nodes that cause resource conflicts with the forced node. If for a 
particular resource conflict several candidate nodes are possible, the heuristic selects the one that was first placed in the partial 
schedule S. Other iterative algorithms [6, 16, 28] eject all the operations that cause a resource conflict. In our iterative 
algorithm, only one is ejected. The heuristic also ejects all previously scheduled predecessors and successors whose 
dependence constraints are violated due to the enforced placement. Notice that all the unscheduled .... 
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....of the register pressure, the pressure on the register buses, and the resource constraints for each cluster. Mil is the maximum 
between (the initiation interval due to resources) and (the initiation interval due to recurrences) these two values are 
computed as in [32J. Then, instructions are scheduled according to their computed cluster assignment. If an instruction cannot 
be scheduled in the assigned cluster, the instruction is moved to a different cluster. If an instruction cannot be scheduled in any 
cluster, the II is increased, the partition is modified .... 

where (6720 63804 0 157 158 159 7 : and where C E8F,1HG is the number of communications necessary to schedule the 
partition, CKJ LMGON G is the number of buses in the architecture and J LMGUTW3 V is the latency of the buses. To compute ; 
we proceed as in [32], but also take into account the latency of the edges between instructions in different clusters. 
Then, assuming X Y 8 Z= # , we try to find a suitable slot for each node. Since the pseudo schedule needs to be computed as 
accurately as possible, nodes are scheduled using the same .... 
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....scheduling technique able to exploit this ILP out of a loop by overlapping operations from various successive loop iterations. 
Different approaches have been proposed in the literature [2] for the generation of software pipelined schedules. Some of them 
mainly focus on achieving high throughput [1, 13, 18, 25, 26, 28]. This work has been supported by the Ministry of Education 
of Spain under contract TIC 98 51 1 , and by CEPBA (European Center for Parallelism of Barcelona) Javier Zalamea is granted by 
the Agenda Espa nola de Cooperaci on Internacional. Register allocation consists in finding the final .... 

....paper presents a novel approach for register spilling in modulo scheduled loops. In this approach, instruction scheduling, 
register allocation, and register spilling are simultaneously in the same step. To achieve this, it uses the ability of some 
previous iterative modulo scheduling techniques [12, 13, 17, 28] to backtrack, i.e. to undo previous scheduling decisions 
and reschedule operations. In order to have reasonable low spill code requirements, MIRS is based on HRMS (Hypernode 
Reduction Modulo Scheduling [23] a register sensitive modulo scheduler. Our proposal is compared with the ideal case .... 
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.... of values [4, 6] In IA 64, on the other hand, unrolling of the kernel loop is unnecessary because rotating registers can 
be used to perform renaming of the registers, thus reducing the code size [5, 6, 7] The Intel IA 64 compiler uses a 
software pipelining algorithm called modulo scheduling [8J. In modulo scheduling, a minimum candidate II is computed prior 
to scheduling. This candidate II is the maximum of the resource constrained minimum II and the recurrenceconstrained 
(dependence cycle constrained) minimum II. prolog epilog kernel loop Figure 25: Execution phases in .... 
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....functions to achieve suitable per formance [2] Priority functions are widely used and tied to complicated factors. A non 
exhaustive list of examples, just in compilation, includes list scheduling [9] clustered scheduling [14] hyperblock 
formation [12] meld scheduling [1] modulo scheduling [1?] and register allocation [6] GP s representation appears ideal 
for improving priority functions. We have tested this observation via two case studies: predication and register allocation. 
Predication Studies show that branch instructions account for nearly 20 of all instructions executed in a .... 
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....provide greater improvement on floating point benchmarks as compared to integer benchmarks. However, floating point 
benchmarks are highly loop intensive and inter region dangles are less of a problem, since most of the performancecritical 
dangles occur at the back edges. Modulo scheduling of loops [10 13] is capable of handling these dangles during 
scheduling. Loop unrolling provides similar benefits. In our experiments, loops were unrolled eight times reducing the impact of 
inter region dangles on the overall performance. For example, the average size of the superblocks was 88 operations for .... 
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....functions to achieve suitable performance [2] Priority functions are widely used and tied to complicated factors. A non 
exhaustive list of examples, just in compilation, includes list scheduling [9] clustered scheduling [14] hyperblock 
formation [12] meld scheduling [1] modulo scheduling [1?J and register allocation [6] GP s representation appears ideal 
for improving priority functions. We have tested this observation via two case studies: predication and register allocation. 4 
Predication Studies show that branch instructions account for nearly 20 of all instructions executed in .... 
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....scheduling, the data dependency graph is useful for guiding optimizations that consider critical path lengths in a program. 
2.3.3.2.4 Modulo scheduling and Rotating register allocations The loop scheduling consists of two modules: the modulo scheduler 
and stage sched uler. The modulo scheduler [41] [50] allocates resources for the loop kernel subject to an initiation 
interval. The stage scheduler moves operations across stages in order to reduce register usage in the loop. When a loop is 
modulo scheduled, some of the virtual registers in the loop are designated as rotating registers 

....on live range splitting (FBS FBR) The percentage of savings are showed in the last column. 4.5 Live range split for predicated 
codes Predication[27] has been included in EPIC style architectures and provides many opportunities of ILP optimization to the 
compiler. It enables modulo scheduling[41] to reduce code expansion and to be scheduled with kernel only codes. More 
corn 91 Benchmark BASE FBS FBS FBR 1 FBS BASE 1 FBS FBR FBS O08.espresso 1487 744 717 49.97 3.63 O23.eqntott 359 
251 239 30.08 4.78 072.se 352 115 115 67.33 0.00 085.gcc 7431 2312 2078 68.89 10.12 .... 
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:...this paper we use the formation of predicated hyperblocks as a case study. Meld scheduling: Abraham et. al rely on a priority 
function to schedule across region boundaries [1] The priority function is used to sort regions by the order in which they should be 
visited. Modulo scheduling: In [1 9], Rau states, As is the case for acyclic list scheduling, there is a limitless number of 
priority functions that can be devised for modulo scheduling. Rau describes the tradeo s involved when considering 
scheduling priorities. Register allocation: Many register allocation algorithms use cost .... 
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....cycles. The objective of a software pipelining method is to construct a schedule that has a high computation rate, or 
equivalents a low II. In the past, resource constrained software pipelining has been studied extensively by several 
researchers and a number of modulo scheduling algorithms [7, 13, 18, 22] have been proposed. For a comprehensive 
survey of software pipelining methods the reader is referred to [21] This paper presents a new power aware software pipelining 
method for VLIW architectures, which can minimize the power consumption of software pipelined loops without sacrificing .... 

....schedule using integer linear programm formulation. Our work does not require such hardware support for frequency and 
voltage scaling. A relevant work in power aware software pipelining is by Yun et al. [28] This work introduces certain 
modifications to the iterative modulo scheduling algorithm [22] to minimize step power the di#erence in the power 
consumed between two consecutive time steps for a software pipelined loop. Their objective is to derive a schedule under 
which power consumption is better balanced under a VLIW architecture. This is di#erent from our objective which .... 
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....to re order a schedule made by some performance oriented scheduling algorithm to achieve energy saving with minimal 
performance degradation. A relevant work by Yun et al. [20] targets poweraware software pipelining. They proposed a heuristic 
algorithm which extended iterative modulo scheduling [13], tries to minimize step power for a software pipelined loop on 
a cycle by cycle basis. 6. Conclusions trade offs in the design space of energy efficient architectures. It studies the interplay 
between lowpower architecture features and compiler optimization techniques, specifically software .... 
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....a time are similar to early attempts at global acyclic scheduling which percolated operations from on basic block to the next 
without any final intended destination. Moving an operation one iteration at a time may create a temporarily worse schedule, 
that can later be transformed into a better one [Rau94]. The problem is to distinguish between such good moves, and shifts 
that genuinely make the schedule worse. Despite the problems of loop shifting, it has one very important advantage it extends 
naturally to software pipelining loops containing branches. Loop shifting algorithms can pipeline .... 
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....the loop. It can result in high performance code but increased register requirements [10] Rau and Eichenberger have done 
research on optimum modulo schedules, taking into consideration the minimum register requirement. They consider not only the 
data ow, but also the control ow of the program [7, 18], None of the above research e orts, however, includes the prefetching 
idea or considers the data fetching latency in their algorithms. We will restrict our study to nested loops with uniform data 
dependencies. Even if most loop nests have ane dependencies, the study of uniform loop nests is .... 
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....pipelining is an important compilation technique applied on loops to exploit instruction l^rel parallelism. In the past, resource 
constrained software pipelining has been studied extensively by several researchers and a number of modulo 
scheduling algorithms have been proposed in the literature [8, 16, 21, 31j. The objective of a software pipelining method is to 
construct a schedule that satisfies both the resource constraints of the architecture and the dependence constraints imposed by 
the program, such that the constructed schedule has a very low initiation interval (II) The schedule which achieves .... 

....available resources. This applies not only to critical instructions, those that are on critical recurrence cycle(s) or those that use 
critical resource(s) but also to all other instructions as well. In certain software pipelining methods, instructions are 
scheduled at the earliest possible time [31]. However, issuing instruction as early as possible may schedule non critical 
instructions along with critical instructions at the same time step, requiring multiple instances of functional units to be active 
simultaneously. As explained earlier, since we assume a power model in which all or none .... 
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....the loop. It can result in high performance code but increased register requirements [10] Rau and Eichenberger have done 
research on optimum modulo schedules, taking into consideration the minimum register requirement. They consider not only the 
data ow, but also the control ow of the program [8,18], None of the above research e orts, however, includes the prefetching 
idea or considers the data fetching latency in their algorithms. DO 10 n1 =1 , N1 DO 20 n2 = 1, N2 y ( n1 , n2 ) x ( n1 , n2 ) c ( 0 , 
1 ) y ( n1 , n2 1 ) c ( 0 , 2 ) y ( n1 , n2 2 ) c ( 1 , 0 ) y (n1 1 .... 
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. .. and StarCore 140) are based on a VLIW design paradigm, with good reason: VLIW offers wide issue (today up to eight 
operations per cycle) with relatively little instruction issue overhead, clustering is natural and offers enhanced scalability 
[1] and compiler techniques such as software pipelining [2j effectively employ the VLIW s many processing units in a 
wide variety of loop kernels. In the embedded market, where power margins dictate use of the lowest possible clock frequency 
to achieve a given processing rate, cycles cannot be wasted waiting for branch resolution and instruction fetch 
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....SCHEDULING OVERVIEW Software pipelining is an aggressive loop scheduling technique for VLIW processors. It transforms 
a sequential loop so that new iterations can start before preceding ones finish, thus overlapping the execution of multiple iterations 
in a pipelined fashion. Modulo scheduling [6, 13] is one of the scheduling algorithms for implementing software pipelining. 
1 Since a large number of loops contain no conditionals, we concentrate on loops with no control flows in this paper. For loops 
with control flows, we assume a hardware mechanism that supports predicated execution 

....high level synthesis [4] and logic level synthesis [8] 2 Such a graph is called a data flow graph (DFG) in the context of 
synchronous VLSI circuits. 3 For simplicity, we assume that the functional units are fully pipelined. Complex resource 
constraints can be handled by resource reservation table [13], 2) r2 = op2(r1,r5) 1) r1 = op1(r3) 3) r3 = op3(r2) 4) r4 = op4(r3) 
5) r5 = op5(r2) 6) r6 = op6(r6) a) NOP NOP NOP NOP NOP NOP (1) 2) 3) n n n (6) n (4) n 1 (5) n 1 NOP NOP NOP NOP (1) 2) 3) 
n n n (6) n (5) n (4) n 1 NOP NOP (b) 1) 2) 3) 0 0 1 0 0 1 1 (6) c) d) .... 
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..,.4.1 Modulo Scheduling Because multiple iterations will be active simultaneously on the network of modules, care must be taken 
so that nonqueue memory accesses from di#erent iterations do not conflict. We utilize a scheduling algorithm directly based 
on Rau s iterative modulo scheduling (IMS) [17], Modulo scheduling is a framework for scheduling a single iteration of a loop in 
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....exploits instruction level parallelism (I LP ) out of a loop by overlapping operations from various successive loop iterations. 
Different approaches have been proposed in the literature [2] for the generation of software pipelined schedules. Some of them 
mainly focus on achieving high throughput [1, 13, 18, 24, 25, 27J. The main drawback of these aggressive scheduling 
techniques is their high register requirements [21 , 23] Using more registers than available requires some actions which reduce the 
register pressure but may also degrade the performance (either due to the additional cycles in the schedule or due .... 

....phase. The II is bounded either by recurrence circuits in the dependence graph of the loop (RecMII) or by resource 
constraints of the target architecture (ResMII) The lower bound on the II is termed the Minimum Initiation Interval (Mil = 
max(RecMII; ResMII) The reader is referred to [13, 27] for an extensive dissertation on how to calculate RecMII and 
ResMII. In order to perform software pipelining, the Hypernode Reduction Modulo Scheduling (HRMS) heuristic [22] is used. 
HRMS is a software pipeliner that achieves the Mil for a large percentage of the workbench considered in this .... 
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....shown in Figure 3(b) The total number of pipe stages (i.e. iterations executing concurrently) on a software pipelined loop body 
is denoted by P. The total number of execution steps required by any such (balanced) pipe stage corresponds to the 
initiation interval (II) of the retimed loop body [4|- That is, a new iteration is started concluded every II steps. For the example, 
in Figure 3(b) the initiation interval and the total number of pipe stages are ll=2 and P=2, respectively. Naturally, the key objective 
of software pipelining retiming is to decrease II, thus increasing the execution .... 
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....niques have been proposed to efficiently exploit the parallelism available in iterative program constructs. Software pipelining [4] 
5] is a loop scheduling technique that extracts parallelism from loops by overlapping operations from various consecutive 
iterations. Modulo scheduling [6] [7J is a class of software pipelining algorithms which has been incorporated in many 
production compilers. In a modulo scheduled loop, the Initiation Interval (II) is the number of cycles between the initiation of 
successive iterations. For a loop, the lower the II the higher the number of operations .... 

....proven to be very effective [14] In [15] the authors present an approach that improves the performance by simultaneously 
performing instruction scheduling, register allocation, and register spilling. To achieve this, it uses the ability of some previous 
iterative modulo scheduling techniques [6] [7], 10] 16] to backtrack, i.e. to undo previous scheduling decisions and 
reschedule operations. Sections II. B and C overview the main proposals in this direction. On the other hand, the organization 
and management of the register file has been a subject of research in the past. The main idea .... 

8. R, Rau. "liereiwe modulo scheduling: An algorithm for software pipelining loops'' in Free, of the 27th Annua! International 
Symposium on Microarchitecture, November 1994. pp. 63-74. 
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....exploit the ILP available in programs [4, 12, 20] Loops are the main time consuming part of numerical programs. Software 
pipelining [5, 14] is a loop scheduling technique that extracts parallelism from loops by overlapping operations from various 
consecutive iterations. Modulo scheduling [8, 22] is a class of software pipelining algorithms which has been incorporated 
in many production compilers. In a modulo scheduled loop, the Initiation Interval (II) is the number of cycles between the 
initiation of successive iterations. For a loop, the lower the II the higher the number of .... 

B. R, Rau. Iterative modulo scheduling: An algorithm for software pipelining loops, in Proa of ihe 27th Annual International 
Symposium or; Microarchitecture, pages 63—74. November 1994. 
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....[1] for the generation of software pipelined schedules. Modulo scheduling [26] is a cla^rof software pipelining algorithms that is 
very cost effective and has been implemented in many production compilers. Most of the early modulo scheduling techniques 
focused mainly on achieving high throughput [1 { 7, 25, 28]. However, one of the drawbacks of modulo scheduling (and 
software pipelining in general) is that they increase the register requirements. This has motivated some recent modulo scheduling 
approaches that not only try to maximize throughput but also try to minimize register requirements [6, 9, 16, .... 

....forcing a node in a particular cycle, the heuristic ejects nodes that cause resource conflicts with the forced node. If for a 
particular resource conflict several candidate nodes are possible, the heuristic selects the one that was first placed in the partial 
schedule S. Other iterative algorithms [6, 16, 28] eject all the operations that cause a resource conflict. In our iterative 
algorithm, only one is ejected. The heuristic also ejects all previously scheduled predecessors and successors whose 
dependence constraints are violated due to the enforced placement. Notice that all the unscheduled .... 

S. R; Rau. Iterative modulo scheduling: An algorithm for software pipelining loops, in Proc. of the 27th Annual internationai 
Symposium on Microarchitecture % pages S3-74, November 199*1. 
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....require a prohibitive time to construct the schedules and therefore their applicability is restricted to very small loops. Therefore, 
practical algorithms use some heuristics to guide the scheduling process. Some of the proposals in the literature only care 
about achieving high throughput [11,19,20,31,32,37] while other proposals have also been targeted towards minimizing 
the register requirements [9,12,18,24] which result in more effective schedules. Stage Scheduling [12] is not a whole modulo 
scheduler by itself but a set of heuristics targeted to reduce the register requirements of any given .... 

....but lower register requirements. Unfortunately there are constraints in the movement of operations that might yield to 
suboptimal reductions of the register requirements. Similar heuristics have been 3 included in the IRIS [9] scheduler, which 
is based on the Iterative Modulo Scheduling [11,311, in order to reduce the register pressure at the same time as the 
scheduling is performed. Slack Scheduling [1 8] is a heuristic technique that simultaneously schedules some operations late and 
other operations early with the aim of reducing the register requirements and achieving maximum .... 
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Advanced Vector Architectures - Esoasa (1997) {Cojied} 



....needed. On top of that, program transformations such as loop blocking [PHH89, WL91, KM92, LRW91b, Li95, CM95] have 
proven very useful to fit the working set of a program into multilevel memory hierarchies. Introduction 9 Related to data 
caching, software pipelining [Lam88, GHW90, GAG94, Jai91 , RLTS92, Ram94, Rau94] has also contributed to hide 
memory latency and the penalties associated with cache misses by overlapping several iterations of a single loop. 

Decoupling Decoupled scalar processors [SWP86, Smi84, KHC94] have focused on numerical computation and attack the 
memory latency problem .... 

B, R. Rau : M. Lee, P. P. Tirtiroaiai, and M. S. Schlansker, Register aHocamn forsoliw&re pipelined loops. In Proceedings of the 
ACM SIGPLAN ( 92 Conference on Programrning Unguaae Design and implementation, pages 283-299. San Francisco, 
California. June 17-19, 1882. SiGPLAN Notices, 27(7), July 18S2. 
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....more simultaneously live values exist than physical registers, spill code must be added and can significantly increase the 
achieved II of the loop. In this case, it may be possible to achieve a better final II by increasing the candidate II and 
attempting to schedule the original loop body again [26]. If a lower bound on the loop s final register requirement for a given II 
were available, it would be useful during both optimization and scheduling. During optimization it could be used to stop 
optimization before excessive register pressure is generated. During scheduling, the candidate II s .... 

B. R. Rau : M. Lee, P. P. 7kurnaia! : and M S. Schlansker, " Register eibo&iion for software pipelined loops:' m Proceedings of [he 
ACM SiGPLAN 92 Conference on Programming Language Design and Implementation, pp. 283-299, June 1992. 
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....the successive outputs of an operation can be kept in distinct registers. In the absence of hardware support, the loop may 
be unrolled and the duplicate register specifiers renamed appropriately [9] However, this modulo variable expansion 
technique can result in a large amount of code expansion [18]. A rotating register file can solve this problem without 
duplicating code. Consider saving the series of values generated by an operation in its own infinite pushdown stack. Old values 
can be read out of anywhere in the stack, and new values can be pushed on top, but a value cannot be modified .... 

....around a vector of length II. In any case, the LiveVector s maximum, MaxLive, is the desired lower bound. Allocating registers 
for a modulo scheduled loop is beyond the scope of this paper. For an extensive discussion of the problem, including 
heuristic solutions and empirical results, consult [18 J. One of the most remarkable results reported in that paper is the ability 
of their allocation strategies to almost always achieve the MaxLive lower boundon a schedule s register pressure 4 . Due to that 
result, this paper approximates a schedule s register pressure with its MaxLive lower bound 

B. R. Rau : M. Lee : P. TirumaisL and M. S. Schiaosker. Register Qiiooaiion for softwam pipelined loops, in Proceedings of the 
ACM SiGPLAN '92 Conference on Programming Language Design and Implementation, pages 283-289. June 1992* 
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....previous pfld just because its load pipeline has three stages. On the contrary, our architecture adopts ordinary waiting 
mechanism for requested data. Due to this fact, our architecture does not need serious changes in the architecture. Modulo 
scheduling on rotating register files is proposed in [RLTS92J. In rotating register files, logical register number is apart from 
physical register number. In this point, rotating register files are similar to our slidewindowed registers. However, in rotating 
register files, the total number of physical registers is not increased. Therefore, long memory .... 

B.R.Rau, M.Lee. PP/nrumalaL arid M.S.Schlansker, "Register Allocation for Software Pipelined Loops", Free. ACM SiGPLAN '82 
Conf. on Programming Language Design and implementation, pp283»2S9, 1882 
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Register Allocation for Predicated Code - Eichenberqer. Davidson (1995) (5 citations) £CfiO£ct). 

....a framework based on cyclic interval graphs, introducing the notion of time in the register allocator paradigm. This additional 
notion of time is particularly useful for the live ranges of a loop, where live ranges may cross the boundary of an iteration. Another 
approach, investigated by Rau et al. [1 2], proposes a general framework for the allocation of registers in software 
pipelined loops for various code generation and hardware support schemes. The second contribution of this paper is a set 
of heuristics that reduces the register requirements by allowing non interfering virtual registers .... 

....For register allocators based on Chaitin s graph coloring framework [9] 10] register allocation for predicated code can be 
achieved simply by using the refined interference graph instead of the conventional one. However, several register allocators 
depart from the graph coloring method [11] [12] as graph coloring methods do not provide a notion of time that is 
particularly useful for the live ranges of a loop, which may cross the boundary of an iteration. Also, nontraditional 
constraints such as the one presented in [12] to support various code generation and hardware support schemes are .... 

[Article contains additional citation context not shown here] 

8. R. Rau : M. Lee, P. P. Tifuroalal, and M. S. Sehlarssfcer, R&g&t&r allocation for software pipelined loops. PLDf. Daoes 283--299 ( 
June 1992. 

An Integrated Approach to Register Binding and Scheduling - Bart Mesman (Correct). 

....to satisfy the timing constraints, software pipelining [2] also called loop pipelining or loop folding, is required. Previously [15] we 
showed that a heuristic like list scheduling for loop pipelining is unable to satisfy the timing and resource constraints even for 
simple examples. Rau et al. [11] successfully perform register binding tuned to pipelined loops. They mention that for better 
code quality Concurrent scheduling and register allocation is preferable , but for reasons of run time efficiency they solve the 
problem of scheduling and register binding in separate phases. Some .... 

B.R. Rau, M. Lee, PP. 'Tirumalai and M.S. Schfansker.. "Register allocation for software pipelined /oops 1 ', Proc. of ihe SiGPLAN 
92 eoof. on Programming language design and implementation, pp. 283-2S9. June 1992 
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....Huffs Slack Scheduling [9] Wang, Eisenbeis, Jourdan and Su s FRLC [23] and Gasperoni and Schwiegelshohn s modified list 
scheduling [6] Experimental results show that the method described in this paper performed significantly better than these 
methods. 1 Introduction Software pipelining [1,4, 9, 11, 12 9 13, 17, 18, 22] has been proposed as an efficient method for 
loop schedul This work was supported by research grants from NSERC (Canada) and MICRONET Network Centers of 
Excellence (Canada) To Appear in the Proceedings of the 27th Annual International Symposium on Microarchitectures 
(MICRO 27) San Jose, .... 

....in Section 7. 2 Exploiting the Space of Software Pipelined Schedules 2. 1 An Example We introduce the notion of rate 
optimal schedules under resource constraints, and illustrate how to search among them the ones which optimize the 
register usage with the help of a simple example loop taken from [1 8J. The loop L (in the C language) is: for (i = 0; i n; i ) f s = 
s a[i] a[i] s s a[i] g The dependence graph for the loop L is depicted in Figure 1 . SO S1 S2 S3 S4 S5 Figure 1 : Dependence Graph 
of Loop L Consider an architecture with 3 pipelined homogeneous function units. Assume .... 
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....is rarely gathered and exploited in the optimizer s strategy. There are isolated instances where this information is used to 
good effect, such as when combining instruction scheduling and register allocation [3, 5, 6, 19, 20, 30, 33, 34] or software 
pipelining and register allocation [16, 1?, 21, 23, 27, 32, 38, 44], While these techniques can improve program performance, 
they focus narrowly on the interaction of a single pair of optimizations, rather than more generally on the entire collection of 
optimizations to be applied to a program. Provided that enough useful information can be gathered and analyzed .... 

....of balance among the levels of demand for specific machine resources of particular interest to the two phases, and the supply 
and configuration of the target machine s resources. The most well known examples of this work focus on the interactions 
between software pipelining register allocation [16, 17, 21. 23, 27, 32. 38, 44], instruction scheduling and register 
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allocation [3, 5, 6, 19, 20, 30, 33, 34] instruction scheduling and cache usage [28] amRcalar replacement and register 
allocation [8] All have in common the goal of creating a good match between the program characteristics, such as 
instruction placement .... 

B. R. Rau. M. Lee, P. P. Tirumaiai, and M. S. Schlatter. Register allocation for software pipelined hops fn ACM SIGPLAN 
Conference on Programming Language Design and implementation, 1992. 
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....Chaitin s technique based on graph coloring[14] Register allocation for software pipelined loops presents additional problems 
leading to unconventional solutions. How to allocate registers for modulo scheduled loops is beyond the scope of this 
paper (for an extensive discussion of the problem see [1 5]) The Wands Only strategy combined with the First Fit 
allocation schema have been chosen to allocate registers. Wands Only is the strategy that has the lowest empirical 
complexity, and the one that obtains the more optimal results in terms of number of registers. For this strategy all the .... 

....M3 and A4. The results of M3 are used by operation A4; since A4 has been scheduled in 5 We have chosen this example 
because it is very simple to calculate the registers required by the schedule. For an extensive discussion of the register 
allocation problem for software pipelined loops see [15], VALUE L1 L2 M3 A4 M5 A6 Allocation GL LO LO RO RO RO 
Lifetime 13 7 6 6 6 4 Table 3: Allocation requirements of values for example loop, the left cluster, the values produced by M3 
could be allocated as left only values. The results of A4 are used by operation M5; since M5 has been scheduled .... 
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....to satisfy the timing constraints, software pipelining [2] also called loop pipelining or loop folding, is required. Previously [15] we 
showed that a heuristic like list scheduling for loop pipelining is unable to satisfy the timing and resource constraints even for 
simple examples. Rau et al. [11] successfully perform register binding tuned to pipelined loops. They mention that for better 
code quality Concurrent scheduling and register allocation is preferable , but for reasons of run time efficiency they solve the 
problem of scheduling and register binding in separate phases. Some .... 
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....three stages. Compared with i860 architecture, our architecture includes ordinary waiting mechanism for requested data and 
successfully closes the growing gap between processor and memory speed without serious changes in the architecture. Modulo 
scheduling on rotating register files is proposed in [RLTS92]. In rotating register files, logical register number is apart from 
physical register number. This is similar to our slide windowed registers. However, in rotating register files, the total number of 
physical registers is not increased. Therefore, long memory access latency cannot be hidden. This .... 
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